Search CORE

5 research outputs found

Using AVX2 Instruction Set to Increase Performance of High Performance Computing Code

Author: Gepner Pawel
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 19/12/2017
Field of study

In this paper we discuss new Intel instruction extensions - Intel Advance Vector Extensions 2 (AVX2) and what these bring to high performance computing (HPC). To illustrate this new systems utilizing AVX2 are evaluated to demonstrate how to effectively exploit AVX2 for HPC types of the code and expose the situation when AVX2 might not be the most effective way to increase performance

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Effective Implementation of DGEMM on Modern Multicore CPU

Author: Gepner Pawel
Gamayunov Victor
Fraser David L.
Publication venue: Published by Elsevier B.V.
Publication date: 10/02/1975
Field of study

AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix multiplication (DGEMM) on the Intel Xeon E5-2680 CPU. We selected an optimal algorithm from the instruction set perspective as well software tools optimized for Intel Advance Vector Extensions (AVX). Our optimizations included the use of vector memory operations, and AVX instructions. Our proposed algorithm achieves a performance improvement of 33% compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine

Elsevier - Publisher Connector

Crossref

Repositori Obert de Coneixement de l'Ajuntament de Barcelona

InterCriteria Analysis of ACO Start Startegies

Author: Marcin Paprzycki
Olympia Roeva
Pawel Gepner
Stefka Fidanova
Publication venue: 'Polish Information Processing Society PTI'
Publication date: 01/10/2016
Field of study

Crossref

Directory of Open Access Journals

Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor

Author: Kamil Halbiniak
Krzysztof Rojek
Lukasz Kuczynski
Lukasz Szustak
Pawel Gepner
Tomasz Olas
Publication venue: Hindawi Limited
Publication date: 01/01/2015
Field of study

The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forward-in-time algorithms and performs a sequence of stencil computations. MPDATA is one of the major parts of the dynamic core of the EULAG geophysical model. In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. In order to utilize available computing resources, we propose the (3 + 1)D decomposition of MPDATA heterogeneous stencil computations. This approach is based on combination of the loop tiling and fusion techniques. It allows us to ease memory/communication bounds and better exploit the theoretical floating point efficiency of target computing platforms. An important method of improving the efficiency of the (3 + 1)D decomposition is partitioning of available cores/threads into work teams. It permits for reducing inter-cache communication overheads. This method also increases opportunities for the efficient distribution of MPDATA computation onto available resources of the Intel MIC architecture, as well as Intel CPUs. We discuss preliminary performance results obtained on two hybrid platforms, containing two CPUs and Intel Xeon Phi. The top-of-the-line Intel Xeon Phi 7120P gives the best performance results, and executes MPDATA almost 2 times faster than two Intel Xeon E5-2697v2 CPUs

Crossref

Directory of Open Access Journals

Steering Customized AI Architectures for HPC Scientific Applications

Author: Abdulah Sameh
Alomairy Rabab
Dabah Adel
Gepner Pawel
Goreczny Chris
Gratadour Damien
Hong Yuxi
Keyes David
Ltaief Hatem
Ravasi Matteo
Publication venue: Springer Nature Switzerland
Publication date: 10/05/2023
Field of study

International audienceAI hardware technologies have revolutionized computational science. While they have been mostly used to accelerate deep learning training and inference models for machine learning, HPC scientific applications do not seem to directly benefit from these specific hardware features unless AI-based components are introduced into their simulation workflows, for instance, as a replacement of their numerical solvers. This paper proposes to take another direction in an attempt to democratize customized AI architectures for HPC scientific computing. The main idea consists in demonstrating how legacy applications can leverage these AI engines after a necessary algorithmic redesign. It is critical that the resulting software implementations map onto the underlying memory-austere hardware architectures to extract the expected performance. To facilitate this process, we promote the matricization technique for restructuring codes (1) by exploiting data sparsity via algebraic compression and (2) by expressing the critical computational phases in terms of tile low-rank matrix-vector multiplications (TLR-MVM) and batch matrix-matrix multiplications (batch GEMM). Algebraic compression enables to reduce memory footprint and to fit into small local cache/memory, while batch execution ensures high occupancy. We highlight how we can steer the Graphcore AI-focused Wafer-on-Wafer Intelligence Processing Units (IPUs) to deliver high performance for both operations. We conduct a performance benchmarking campaign of these two matrix operations that account for most of the elapsed times of four real applications in computational astronomy, seismic imaging, wireless communications, and climate/weather predictions. We report bandwidth and execution rates with speedup factors up to 150X/14X/25X/40X, respectively, on IPUs compared to other systems. H. Ltaief et al

HAL-INSU

HAL-OBSPM